Tagging and Parsing an Artificial Language An Annotated Web-Corpus of Esperanto

نویسنده

  • Eckhard Bick
چکیده

In the first half of this paper, we present and evaluate EspGram a Constraint Grammar (CG) -based parser for the artificial language Esperanto. The second half of the paper describes the compilation and annotation of a corpus of 18.5 million words covering Esperanto literature, news text and web pages. As a planned language, conceived to be easy to learn and flexible to use, Esperanto has a highly regular morphology, where clearly perceived morphemes match linguistic categories almost one-on-one. Also, the core lexicon of the language was designed to avoid unnecessary ambiguity. Thus, morphological/lexematic ambiguity is almost entirely restricted to cross-compound ambiguity, and the average number of morphological readings is 1.12 readings per non-name word, as opposed to around 2.0 for most natural languages (depending on the way ambiguity is counted). Though since its inception (Zamenhof, 1887), the language has been allowed to evolve as a living system, most changes have occurred at the lexical level, and the morphological system remains largely unchanged. On the other hand, the relatively free word order of the language in combination with syntactic usage influence from different natural languages has led to a language system and a speaker community very tolerant of syntactic variation, where norms are statistical rather than absolute. This situation has important bearings on both parsing technology and corpus linguistics. First, with a reduced need for disambiguation, a part-of-speech tagger can be assumed to be almost identical to a morphological analizer, while a syntactic parser will face a number of challenges. Second, a corpus of correct but international Esperanto may offer interesting insights in lexical and syntactic variation, reminiscent of the variation of non-native, international English, the difference being that in Esperanto, such variation is not stigmatized, but rather allowed or even supported by the flexibility of the language system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Feature extraction in opinion mining through Persian reviews

Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...

متن کامل

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

An Annotated Corpus Management Tool: ChaKi

Large scale annotated corpora are very important not only in linguistic research but also in practical natural language processing tasks since a number of practical tools such as Part-of-speech (POS) taggers and syntactic parsers are now corpus-based or machine learningbased systems which require some amount of accurately annotated corpora. This article presents an annotated corpus management t...

متن کامل

Portable Language Technology: a Resource-light Approach to Morpho-syntactic Tagging

Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing. Corpora that have been morphologically tagged are very useful both for linguistic research, e.g. finding instances or frequencies of particular constructions in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007